Red Wine Quality by Xueming Li

This reports explores a data set containing quality and attributes for approximately 1600 wines.

The data set has 1599 wine records, with 12 variable for each record. According to the summary of the dataset, column X is the row number which means it’s meaningless in this case, so column X shall be omitted. Wine quality is scored in a level from 0 ~ 10, so this shall be changed to a ordered factor.

## [1] 1599   12
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 53  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20   5:681  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42   6:638  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   7:199  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90   8: 18

Univariate Plots Section

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Quality is the concern of this report, so I will start with the quality counts. Most of the wine scored at 5 or 6, only 10 of them scored 3, and 18 of them scored 8. Due to that the score is at least 3, and 10 will the the maximum quality, so in this analysis I take 3 as the worst quality, and 8 as good wine, 5 and 6 are in avearage.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The fixed.acidity distribution is slightly skewed, with most of the fixed acidity around 7.0 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Volatile acidity content in the wine is generally much lower than fixed acidity. Too high level of volatile acidity will bring unpleasant taste. As shown in this plot, most wine have volatile acidity at around 0.52 g/dm^3, the outliers for this is 1.58 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The citric acid is distributed among 0 and 0.75 g/dm^3. We can see that citric acid amount is almost uniformly distributed, among which there are lots of wine with zero citric.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Most of the residual sugar content are in a range of 1.900 to 2.600, centered by 2.5 g/dm^3. According to the plots, the outliers have redisual sugar more than 8.0 g/dm^3 in the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chloride count is normally distributed, mostly at 0.079 g/dm^3, while there are also some wine have more than 0.2 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

The distribution of free sulfur dioxide is skewed to the right, so I transform the x axis by ‘sqrt’

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

By transforming the x-axis, the distribution looks more clear now. Most of the free sulfru dioxide are in a range of 7 to 40. I am wondering if the sulfur dioxide/ SO2 will affect the taste of the wine.

##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1080           7.9              0.3        0.68            8.3      0.05
## 1082           7.9              0.3        0.68            8.3      0.05
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 1080                37.5                  278 0.99316 3.01      0.51
## 1082                37.5                  289 0.99316 3.01      0.51
##      alcohol quality
## 1080    12.3       7
## 1082    12.3       7
##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 110            8.1            0.785        0.52            2.0     0.122
## 355            6.1            0.210        0.40            1.4     0.066
## 516            8.5            0.655        0.49            6.1     0.122
## 652            9.8            0.880        0.25            2.5     0.104
## 673            9.8            1.240        0.34            2.0     0.079
## 685            9.8            0.980        0.32            2.3     0.078
## 1245           5.9            0.290        0.25           13.4     0.067
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 110                 37.0                  153 0.99690 3.21      0.69
## 355                 40.5                  165 0.99120 3.25      0.59
## 516                 34.0                  151 1.00100 3.31      1.14
## 652                 35.0                  155 1.00100 3.41      0.67
## 673                 32.0                  151 0.99800 3.15      0.53
## 685                 35.0                  152 0.99800 3.25      0.48
## 1245                72.0                  160 0.99721 3.33      0.54
##      alcohol quality
## 110      9.3       5
## 355     11.9       6
## 516      9.3       5
## 652     11.2       5
## 673      9.5       5
## 685      9.4       5
## 1245    10.3       6

The distribution of free sulfur dioxide is skewed to right, with most of the contents is 14.00 mg/dm^3. So I transformed the x by log10(). After the transforming, the distribution is normal now.

There are two outliers for total sulfur dioxide >200, and 7 samples >150, all of them have quality of 5~7, which means when total sulfur dioxide is high, the taste may be above avearge. However, the sample population is not big enough, we cannot get any conculsion about this at this moment.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02273 0.25930 0.37500 0.38230 0.48480 0.85710

The ratio of free sulfur dioxide to the total sulfur dioxide is normally distributed. Most of the wines have the ratio around 0.259 to 0.485.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

The density of the wine is normaly distributed with most of the density at 0.9956 g/cm^3, and there are very rare wine has density higher than puur water (1 g/cm^3),

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Most wine in this dataset are between 3.21 and 3.40 in pH scale.

Sulphates is an additive acting as an antimicrobial and antioxidant in the wine. Most wine has sulphates between 0.5 to 0.8 g/dm^3. The distribution is slightly skewed to the right.

The alcholo content is slightly skewed to the right, while most of wine have alcholo between 5% and 13%.

Univariate Analysis

What is the structure of your dataset?

The data set has 1599 wine records, with 12 variable for each record. The variable quality is ordered factor variable with level from 3 ~8 (higher level means better taste)

Other observations are - Most wine quality are 5 and 6 - Most wine have density less than water

What is/are the main feature(s) of interest in your dataset?

  • Alcohol content
  • volitale acidity
  • citric acid
  • free sulfur dioxide / total sulfur dioxide

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

  • residual sugar
  • Sulphate
  • density

Did you create any new variables from existing variables in the dataset?

  • I investigated the rate of free sulfur dioxie to total sulfur dioxide (slf_rate), since too high of SO2 (free sulfur dioxide) content might affect the taste, i am gonna study the rate of free sulfur dioxide as well.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  • Quality column was changed to ordered factor variables.
  • Column X is the row number, so it’s removed.

Bivariate Plots Section

To get an intial sense of the relationship between the variables, a grid plot was conducted. Abbreviation in the figure are as follows: - f_acid: fixed acidity: g/dm^3 - v_acid: volatile acidity: g/dm^3 - c_acid: citric acid: g/dm^3 - r_sg: redisual sugar: g/dm^3 - fsd: free sulfur dioxide: mg/dm^3 - tsd: total sulfur dioxide: mg/cm^3 - den: density: g/cm^3 - slf: sulphates: g/dm^3 - pH: pH - alc: alcohol: % - quality: quality (1- 10)

According to the correlation coefficients, alcohol is the strongest factor to correlate with quality, followed by volatile acidity, with correlation coefficients of 0.476 and -0.391 respecitively. Sulphates and citric acid also show a weak correlation with the quality.

In addition to these, citric acid is correlated with volatile acidity ( -0.552 ), and density is correlated with alcohol (-0.496)

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.700   7.150   7.500   8.360   9.875  11.600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.779   8.400  12.500 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.100   7.800   8.167   8.900  15.900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.000   7.900   8.347   9.400  14.300 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.800   8.872  10.100  15.600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.250   8.250   8.567  10.230  12.600

The fixed acidity varies at different qualtiy wine. Wine from quality of 3 to 8 all have same range of fixed accidity, indicating that the fixed acidity doesn’t look like to be correlated with the quality.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

The relationship between volatile acidity and quality is clear. The lower qualty wine tend to have higher volatile acidity scale. The median volatile acidity of quality level 8 (0.370) is much lower than of level 3 (0.845).

However, even some high quality wine have volatile acidity higher than the median value of level 3. We can see that volatile is very important factor for the wine taste, but it doesnt mean high volatile acidity will definetely tastes bad.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

Though wine with zero citric acid may be found within all quality range, the plots show that the wine with higher quality tends to have higher citric acid content, indicating citric acid is weakly correlated with the quality.

Volatile acidity is correlated with citric acid. The higher citric acid the wine has, the less volatile acidity it has.

However, the ratio of citric acid to volatile acidity doesnt show any trend with quality.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.875   2.100   2.635   3.100   5.700 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.900   2.100   2.694   2.800  12.900 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.200   2.529   2.600  15.500 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.477   2.500  15.400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.721   2.750   8.900 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.800   2.100   2.578   2.600   6.400

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0610  0.0790  0.0905  0.1225  0.1430  0.2670 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600

Both residual sugar and chloride don’t look like to be correlated with the quality.

slfr_p <- bi_jitter("quality", "slf_rate")
  
grid.arrange(slfr_p)

##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 15             8.9            0.620        0.18            3.8     0.176
## 16             8.9            0.620        0.19            3.9     0.170
## 58             7.5            0.630        0.12            5.1     0.111
## 397            6.6            0.735        0.02            7.9     0.122
## 401            6.6            0.735        0.02            7.9     0.122
## 585           11.8            0.330        0.49            3.4     0.093
## 926            8.6            0.220        0.36            1.9     0.064
## 927            9.4            0.240        0.33            2.3     0.061
## 983            7.3            0.520        0.32            2.1     0.070
## 1132           5.9            0.190        0.21            1.7     0.045
## 1155           6.6            0.580        0.00            2.2     0.100
## 1245           5.9            0.290        0.25           13.4     0.067
## 1296           6.6            0.630        0.00            4.3     0.093
## 1297           6.6            0.630        0.00            4.3     0.093
## 1359           7.4            0.640        0.17            5.4     0.168
## 1435          10.2            0.540        0.37           15.4     0.214
## 1436          10.2            0.540        0.37           15.4     0.214
## 1559           6.9            0.630        0.33            6.7     0.235
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 15                    52                145.0 0.99860 3.16      0.88
## 16                    51                148.0 0.99860 3.17      0.93
## 58                    50                110.0 0.99830 3.26      0.77
## 397                   68                124.0 0.99940 3.47      0.53
## 401                   68                124.0 0.99940 3.47      0.53
## 585                   54                 80.0 1.00020 3.30      0.76
## 926                   53                 77.0 0.99604 3.47      0.87
## 927                   52                 73.0 0.99786 3.47      0.90
## 983                   51                 70.0 0.99418 3.34      0.82
## 1132                  57                135.0 0.99341 3.32      0.44
## 1155                  50                 63.0 0.99544 3.59      0.68
## 1245                  72                160.0 0.99721 3.33      0.54
## 1296                  51                 77.5 0.99558 3.20      0.45
## 1297                  51                 77.5 0.99558 3.20      0.45
## 1359                  52                 98.0 0.99736 3.28      0.50
## 1435                  55                 95.0 1.00369 3.18      0.77
## 1436                  55                 95.0 1.00369 3.18      0.77
## 1559                  66                115.0 0.99787 3.22      0.56
##      alcohol quality  slf_rate
## 15       9.2       5 0.3586207
## 16       9.2       5 0.3445946
## 58       9.4       5 0.4545455
## 397      9.9       5 0.5483871
## 401      9.9       5 0.5483871
## 585     10.7       7 0.6750000
## 926     11.0       7 0.6883117
## 927     10.2       6 0.7123288
## 983     12.9       6 0.7285714
## 1132     9.5       5 0.4222222
## 1155    11.4       6 0.7936508
## 1245    10.3       6 0.4500000
## 1296     9.5       5 0.6580645
## 1297     9.5       5 0.6580645
## 1359     9.5       5 0.5306122
## 1435     9.0       6 0.5789474
## 1436     9.0       6 0.5789474
## 1559     9.5       5 0.5739130

Nothing stands out for free sulfur dioxide and total sulfur dioxide, neither does the rate of free sulfur dioxide to total sulfur dioxide. According to theory, when free sulfur dioxide is more than 50, the less total sulfur dioxide the wine has, the better its quality will be. However, there are only 18 samples showing this trend, which is not convincing enough for the conclusion.

There’s not obvious relationship between sulphates and free sulphur dioxide or total sulphur dioxide, however, the sulphates is correlated with the quality. Higher suphates contents tend to get higher quality scores.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9947  0.9962  0.9976  0.9975  0.9988  1.0010 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9956  0.9965  0.9965  0.9974  1.0010 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9926  0.9962  0.9970  0.9971  0.9979  1.0030 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9954  0.9966  0.9966  0.9979  1.0040 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9948  0.9958  0.9961  0.9974  1.0030 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9908  0.9942  0.9949  0.9952  0.9972  0.9988

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.160   3.312   3.390   3.398   3.495   3.630 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.370   3.382   3.500   3.900 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.300   3.305   3.400   3.740 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.920   3.200   3.280   3.291   3.380   3.780 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.162   3.230   3.267   3.350   3.720

Both density and pH don’t look to be correlated with quality.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Alcohol shows the strongest correlation with quality. According to the plots, higher alcohol tends to get higher quality. The median alcohol for level 8 (12.09) is much higher than of level 3 (9.925).

The density is related to the residual sugar and alcohol. Higher redisual sugar leads to higher density, while higher alcohol leads to lower density.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • According to the correlation coefficients table, alcohol and volatile acidity are the strongest factor to be correlated with quality, with correlation coefficients of 0.476 and -0.391 respecitively. Sulphates and citric acid also show a weak correlation with the quality.
  • Volatile acidity, the acetic acid in wine, tends to affect the taste in a negative way. The taste goes lower when volatile acidity goes higher.
  • Citric acid, in the contrary, adds freshness and flavor to wine and affect the taste in a positive way. The tastes get better when citric acid goes higher.
  • Citric acid and volatile acidity is correlated. Wine with higher citric acid tends to have lower volatile acidity.
  • Higher sulphate or alcohol contents leads to higher quality of the wine.
  • The total sulfur dioxide and sulphates dont look like correlated with each other.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • Density is correlated with both alcohol and residual sugar. Higher redisual sugar leads to higher density, while higher alcohol leads to lower density.

What was the strongest relationship you found?

  • The relationship between alcohol and quality, with highest correlation coefficient of 0.476. According to the box plot and scatter plot, the trend is very clear that wine with higher alcohol have lower quality.

Multivariate Plots Section

We can see that there are lots of wine with zero citric acid. So I extracted all these samples to see their quality. Most of these wines are 5~6, which is not surprising. By checking the volatile ~ quality plot, the trend that high volatile acidity get lower quality is also clear. Which indicates that volatile is strongly correlated with quality.

The citric acid looks like linearly corelated with the volatile acidity. Wine with higher citric acid tends to have lower volatile acidity. However, the ratio of citric acid to volotaile acidity doesn’t show any reltionship with the quality.

Overall, higher sulphate leads to better tastes, however, the sulphate content difference between good quality and bad quality wine doesn’t stand out.

Residual sugar and alcohol are both corelated with density of the wine. However, alcohol shows stronger difference in different wine qualities.

Alcohol and volatile acidity are the strongest factors to be correlated with quality, however, these two factors dont affect each other by the trend. The plot also shows that most dark dots(high quality) stays at high alcohol and low volatile acidity, while light dots(low quality) stay at low alcohol and high volatile acidity area.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  • It was strengthened that volatile acidity is correlated with wine quality. Because even when citric acid is zero, the trend that higher volatile acidity gets lower wine quality is still clear.

Were there any interesting or surprising interactions between features?

  • The citric acid looks like linearly corelated with the volatile acidity. Wine with higher citric acid tends to have lower volatile acidity.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Most of the wine got a score at 5~6, while there are also low scores of 3, and high scores for 8.

Plot Two

Description Two

Volatile acidity is strongly correlated with quality, wine with higher volatile acidity tends to have lower quality, the trend is also shown from the red line representing the median value of each quality group.

Plot Three

Description Three

Alcohol and volatile acidity are the strongest factors to be correlated with quality, however, these two factors don’t affect each other by the trend. The plot also shows that most dark dots(high quality) stays at high alcohol and low volatile acidity, while light dots(low quality) stay at low alcohol and high volatile acidity area.


Reflection

The data set has 1599 wine records, with 12 variable for each record. I started exploring the data by plotting the distribution, then investigate the relationship between the main features. Eventually I studied the three major attributes affecting the wine taste.

There is a clear trend between alcohol and quality, volatile acidity and quality.

There are also some limitations about this dataset. As to the relationship between total sulfur dioxide and taste when free sulfur dioxide is higher than 50 ppm, high total sulfur dioxide affected the tastes of the wine. This can be found from the data set, however, there are only 13 wine samples with free sulfur dioxide than 50ppm, which is not enough to prove the theory.